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The  predictive  validity  of  cognitive  ability  and  personality  traits  was  examined  in  large  samples  of 
U.S.  Air  Force  pilot  trainees.  Criterion  data  were  collected  between  1995  and  2008  from  4  train¬ 
ing  bases  across  3  training  tracks.  Analyses  also  examined  consistency  in  pilot  aptitude  and  training 
outcomes.  Results  were  consistent  with  previous  research  indicating  cognitive  ability  is  the  best  pre¬ 
dictor  of  pilot  training  performance.  There  were  few  differences  across  training  tracks,  bases,  and 
years,  and  none  was  large.  Overall,  results  illustrated  the  consistency  of  the  quality  of  pilot  trainees 
as  assessed  by  cognitive  ability  and  personality  trait  measures,  and  the  consistency  of  these  mea¬ 
sures  in  predicting  training  performance  over  time.  This  consistency  results  in  a  more  stable  training 
system,  enabling  greater  efficiency  and  effectiveness. 


The  selection  and  training  of  military  pilots  is  paramount  to  the  success  of  the  pilots  and  the  mili¬ 
tary  mission.  The  selection  of  military  pilot  trainees  is  a  vital  and  critical  task.  Pilots  are  not  only 
highly  valued;  they  are  also  expensive  to  train.  The  dollar  costs  of  training  are  high  and  the  risk  to 
life  and  property  are  great.  Therefore,  it  is  important  to  ensure  that  the  quality  of  pilot  candidates 
remains  high  and  stable  over  time,  permitting  pilot  training  to  be  as  efficient  and  effective  as  pos¬ 
sible.  This  article  examines  the  predictive  validity  of  cognitive  ability  and  personality  measures 
for  U.S.  Air  Force  (USAF)  pilot  trainees  and  the  consistency  of  these  relations  across  training 
tracks,  bases,  and  time. 


BACKGROUND 

The  training  of  USAF  pilots  takes  place  in  phases  and  at  several  different  locations.  Some  of 
these  locations  also  train  pilots  for  other  military  services,  both  U.S.  and  international.  For 
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example,  U.S.  Navy  aviators  and  European  or  other  international  military  train  at  USAF  facil¬ 
ities.  Pilot  training  consists  of  three  phases — academic  classes  and  preflight  training,  primary 
aircraft  training,  and  advanced  aircraft  training.  Academic  and  preflight  training  course  content 
includes  aerospace  physiology;  ejection  seat,  egress,  and  parachute  landing;  aircraft  systems; 
instruments;  mission  planning;  navigation;  and  weather.  Primary  and  advanced  (fighter/homber 
or  airlift/tanker)  aircraft  training  is  designed  to  teach  flying  skills  with  a  focus  on  combat, 
instruments,  formation,  and  navigation.  Although  each  training  location  follows  roughly  the 
same  training  syllabus  to  ensure  coverage  of  common  knowledge,  skills,  and  abilities  required 
for  success,  there  are  differences,  with  the  Euro-NATO  Joint  Jet  Pilot  Training  (ENJJPT)  pro¬ 
gram  at  Sheppard  Air  Force  Base  being  the  most  divergent  (King  &  Lochridge,  1991).  The 
ENJJPT  program  is  focused  on  training  of  combat  pilots.  Unlike  Specialized  Undergraduate  Pilot 
Training  (SUPT),  which  is  taught  at  Columbus,  Laughlin,  and  Vance  Air  Force  Bases,  ENJJPT 
has  no  airlift /tanker  advanced  training  track.  Also,  ENJJPT  students  receive  more  hands-on  fly¬ 
ing  hours  in  both  the  Primary  and  Advanced  T-38  phases  than  those  attending  SUPT  (see  http;// 
www.baseops.net/militarypilot/).  A  more  detailed  description  of  primary  and  advanced  training 
is  provided  in  the  Method  section. 

The  primary  purpose  of  this  study  was  to  examine  the  predictive  validity  of  cognitive  abil¬ 
ity  and  personality  across  three  training  tracks  and  four  training  bases  over  a  14-year  period. 
Determining  the  generalizability  of  the  predictive  validity  of  these  constructs  is  important,  as 
they  have  been  mainstays  in  pilot  selection  batteries  for  many  years  (Carretta  &  Ree,  2003). 
A  secondary  purpose  was  to  examine  the  consistency  of  pilot  trainee  quality  and  training  perfor¬ 
mance  across  training  tracks,  bases,  and  time  period.  Maintaining  a  consistently  high  level  of  pilot 
trainee  quality  and  training  performance  over  time  is  crucial  to  ensuring  the  stability  and  effec¬ 
tiveness  of  the  Air  Force.  Consistency  should  mean  fewer  changes  and  costs  due  to  changes.  Pilot 
trainee  quality  was  measured  using  standardized  tests  of  cognitive  ability  and  personality  traits. 
Training  performance  was  measured  using  a  composite  of  flying  grades  developed  by  USAF  Air 
Education  and  Training  Command  (AETC). 


USAF  PILOT  CANDIDATE  SELECTION  METHODS 

All  USAF  pilot  training  applicants  must  pass  the  rigorous  Class  I  flight  physical  standards  (U.S. 
Air  Force,  2011)  to  be  eligible  for  selection.  Medically  qualified  applicants  are  evaluated  for 
training  suitability  on  measures  of  officership  and  aptitude  (Weeks  &  Zelenski,  1998).  USAF 
Academy  cadets  are  evaluated  by  Academy  faculty  and  staff  who  consider  academic,  military, 
and  physical  performance.  Applicants  commissioned  through  the  Reserve  Officer  Training  Corps 
(ROTC)  or  Officer  Training  School  (OTS)  are  administered  the  Air  Force  Officer  Qualifying 
Test  (AFOQT;  Drasgow,  Nye,  Carretta,  &  Ree,  2010)  and  Test  of  Basic  Aviation  Skills  (TBAS; 
Carretta,  2005).  A  measure  of  pilot  training  aptitude,  the  Pilot  Candidate  Selection  Method 
(PCSM;  Carretta,  2011)  score,  is  created  by  combining  the  AFOQT  Pilot  composite,  several 
TBAS  subtest  scores,  and  the  total  number  of  flying  hours  logged  either  as  a  student  pilot  or 
as  pilot  in  command'  in  a  regression-weighted  equation.  For  ROTC,  medically  qualified  pilot 


'These  are  the  number  of  flying  hours  in  a  Eederal  Aviation  Administration  logbook  and  do  not  include  hours  in  a 
flight  simulator. 
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training  applicants  are  ranked  on  an  Order  of  Merit  score  based  on  the  PCSM  score,  field  train¬ 
ing,  physical  fitness,  college  grade-point  average  (GPA),  and  commander’s  ranking.  OTS  pilot 
training  candidate  selection  uses  the  “whole  person”  concept.  Each  OTS  pilot  training  board 
member  independently  reviews  the  information  in  applicants’  folders  and  scores  each  applicant 
in  three  areas:  experience/leadership,  education/aptitude,  and  potential/ adaptability.  If  the  scores 
for  an  applicant  are  not  consistent  across  board  members  they  discuss  their  scoring  rationale  until 
a  sufficient  level  of  agreement  has  been  reached.  Regardless  of  commissioning  source,  a  common 
theme  in  pilot  trainee  selection  procedures  is  high  intelligence,  whether  it  involves  acceptance 
into  the  USAF  Academy,  a  high  GPA,  a  high  AFOQT  score,  or  the  impression  a  candidate  makes 
on  a  selection  board. 

Medical  Flight  Screening 

In  addition  to  the  pilot  trainee  selection  procedures  already  described,  all  candidates  must  com¬ 
plete  Medical  Flight  Screening  (MFS;  King  &  Flynn,  1995).  The  USAF  MFS  program  screens 
pilot  candidates  prior  to  SUPT.  MFS  includes  ophthalmic  and  cardiac  diagnostic  procedures 
as  well  as  several  psychological  tests  (King,  Barto,  Ree,  &  Teachout,  2011;  King,  Barto,  Ree, 
Teachout,  &  Retzlaff,  2011),  including  measures  of  cognitive  ability  (Multidimensional  Aptitude 
Battery  [MAB;  Jackson,  2003]  and  MicroCog)  and  personality  (Revised  NEO  Personality 
Inventory  [NEO  PI-R;  Costa  &  McCrae,  1985]  and  Minnesota  Multiphasic  Personality 
Inventory-2  [Butcher,  Graham,  Ben-  Porath,  Dahlstrom,  &  Kaemmer,  2001])  tests. 

Cognitive  tests.  The  primary  purpose  of  the  cognitive  tests  is  to  archive  cognitive  function¬ 
ing  data  for  future  use  in  ideographic  assessments  where  an  individual  is  compared  to  himself  or 
herself  rather  than  to  a  collection  of  norms  from  a  large  population.  The  objective  is  to  develop  an 
individual  registry  against  which  future  testing  might  he  compared.  Test  results  are  particularly 
important  for  pilots  seeking  a  waiver  for  retum-to-flying  status  following  an  illness  or  injury  that 
might  have  resulted  in  cognitive  impairment  (Chappelle,  Ree,  Barto,  Teachout,  &  Thompson, 
2010).  During  an  evaluation,  performance  on  the  cognitive  tests  is  compared  with  baseline  scores 
collected  prior  to  pilot  training  to  determine  whether  any  changes  have  occurred.  Individualized 
(pre-post)  comparisons  result  in  more  reliable  return-to-flying  decisions  as  pilots  typically  are 
very  high  in  cognitive  functioning,  especially  in  comparison  to  general  population  norms,  and 
might  remain  so  even  after  an  injury  or  neurological  event  (King,  2012). 

In  addition  to  their  clinical  use,  a  recent  study  demonstrated  that  scores  from  the  MAB  and 
MicroCog  were  useful  in  predicting  performance  on  several  pilot  training  performance  crite¬ 
ria  including  graduation  or  elimination  from  initial  jet  training  and  course  grades  (King  et  al., 
2013).  These  results  were  consistent  with  prior  studies  of  the  relations  of  cognitive  ability  to  pilot 
training  performance  (Carretta  &  Ree,  2003;  Ree  &  Carretta,  1996). 

Personality  tests.  The  USAF  does  not  use  measures  of  personality  for  pilot  training  selec¬ 
tion.  Measures  of  personality  based  on  the  Big  Five  modeF  (Goldberg,  1981)  are  administered 
by  the  Aeromedical  Consultation  Service  USAF  School  of  Aerospace  Medicine  prior  to  entry 


^The  Big  Eive  personality  traits  are  five  broad  domains  or  dimensions  used  to  describe  human  personality. 
The  domains  are  Neuroticism  (sometimes  called  emotional  stability).  Extraversion,  Openness,  Agreeableness,  and 
Conscientiousness. 
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into  pilot  training.  As  with  the  MAB,  these  pretraining  measures  provide  a  baseline  in  subse¬ 
quent  psychological  assessments  when  pilots  are  being  considered  for  return-to-flying  duties 
after  receiving  a  medically  disqualifying  diagnosis.  Archived  personality  test  scores  can  be  com¬ 
pared  to  the  pilot’s  current  functioning  when  seeking  a  waiver  to  the  medical  standards  (U.S.  Air 
Force,  2011).  The  operational  personality  assessment  tool  is  the  NEO  PI-R  (Costa  &  McCrae, 
1985),  a  Big  Five  measure  that  provides  domain  scores  on  Neuroticism,  Extraversion,  Openness 
to  Experience,  Agreeableness,  and  Conscientiousness. 

In  one  of  the  earliest  reported  studies  of  the  use  of  personality  tests  for  flying  personnel.  Sells 
(1955)  showed  the  utility  of  the  personality  constructs  of  “motivation  to  fly”  and  “expression 
of  anxieties  about  flying.”  Siem  (1992)  demonstrated  the  predictive  validity  of  the  personality 
constructs  of  hostility  (r  =  -.12),  self-confidence  (r  =  .13),  and  values  flexibility  (r  =  .12)  versus 
training  completion  in  a  sample  of  509  USAE  student  pilots.  Training  graduates  scored  higher 
on  self-confidence  and  values  flexibility  and  lower  on  hostility  than  did  those  who  failed  due  to 
flying  training  deficiency. 

Anesgart  and  Callister  (2001)  examined  the  relationships  between  the  NEO  PI-R  Big  Five 
domain  scores  and  success  in  flying  training  in  a  high-wing,  propeller-driven  monoplane.  They 
reported  that  Neuroticism,  Extraversion,  and  Openness  were  related  to  self-elimination  from  the 
program.  Boyd,  Patterson,  and  Thompson  (2005)  reported  statistically  significant  differences 
between  the  scores  of  pilots  assigned  to  fly  airlift/tankers  versus  those  assigned  to  fly  fighters 
for  the  NEO  PI-R  domains  of  Agreeableness  and  Conscientiousness.  Fighter  pilots  had  lower 
levels  of  Agreeableness  and  higher  levels  of  Conscientiousness. 

Meta-analyses  (Campbell,  Castaneda,  &  Pulos,  2010;  Hunter  &  Burke,  1994;  Martinussen, 
1996)  have  reported  modest  correlations  between  measures  of  personality  and  pilot  training  per¬ 
formance.  D.  R.  Hunter  and  Burke  (1994)  reported  a  small  correlation  (r  =  .10)  for  personality 
as  a  predictor  of  flying  training  criteria.  Martinussen  (1996)  reported  a  small  correlation  (r  = 
.14)  for  personality  with  training  completion  (pass-fail).  More  recently,  Campbell  et  al.  (2010) 
performed  a  meta-analysis  on  26  studies  examining  the  effects  of  personality  as  a  predictor  of 
pilot  training  completion  (pass-fail).  Two  higher  order  personality  domains,  Neuroticism  (r  = 
-.15)  and  Extraversion  (r  =  .13),  and  one  lower  order  facet  of  Neuroticism,  Anxiety  (r  =  -.11), 
were  found  to  have  an  impact  on  training  success.  After  correction  for  range  restriction  and  reli¬ 
ability  of  the  predictors,  the  correlations  were  -.25  for  Neuroticism,  .17  for  Extraversion,  and 
-.14  for  Anxiety.  The  authors  concluded  that  emotionally  stable,  extroverted  individuals  would 
be  better  able  to  undergo  the  stress  of  aviation  training. 

Finally,  Chidester,  Helmreich,  Gregorich,  and  Geis  (1991)  examined  the  relations  between 
personality  and  crew  coordination  training  performance  in  two  samples  of  military  pilots. 
Three  profiles  were  identified  through  cluster  analysis  of  the  personality  scales  Positive 
Instrumental/Expressive,  Negative  Instrumental,  and  Low  Motivation.  These  clusters  replicated 
across  samples  and  predicted  attitude  change  following  crew  coordination  training. 


Purposes 

The  purpose  of  this  study  was  to  examine  the  predictive  validity  of  cognitive  ability  and  per¬ 
sonality  traits  for  pilot  training  performance.  We  also  examined  the  consistency  of  pilot  trainee 
cognitive  ability,  personality  traits,  and  training  success  across  three  training  tracks,  four  training 
bases,  and  over  a  14-year  period.  Maintaining  a  consistently  high  level  of  pilot  trainee  quality  and 
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training  performance  over  time  is  crucial  to  ensuring  an  effective  operational  pilot  cadre.  Details 
regarding  the  predictor  and  criterion  measures  are  provided  in  the  Method  section.  Because 
consistency  is  vital  to  training  success,  fewer  statistical  differences  are  evidence  of  greater  consis¬ 
tency  and  stability  of  the  training  system.  To  begin,  we  examined  whether  there  were  mean  score 
differences  in  the  cognitive,  personality,  and  criterion  scores  across  the  training  tracks,  bases,  and 
time  period.  Further,  we  examined  the  predictive  validity  of  the  cognitive  and  personality  scores 
for  pilot  training  performance.  Flere,  consistency  of  prediction  across  tracks,  bases,  and  time  is 
important,  as  well  as  consistency  with  previous  studies  relating  cognitive  ability  and  personality 
to  pilot  training  performance. 


METHOD 


Participants 

A  sample  of  9,641  individuals  selected  for  pilot  training  was  administered  the  MAB  and  the  NEO 
Pl-R  prior  to  beginning  the  53-week  SUPT  program.  All  participants  were  college  graduates  or 
were  near  completion  of  college  at  time  of  testing.  Selection  ratios  for  pilot  training  assignments 
vary  from  year  to  year  as  a  function  of  the  number  of  applicants  and  the  number  of  training 
positions  available  for  each  commissioning  source.  Of  the  participants  reporting  demographic 
information  (98.5%),  all  were  under  the  age  of  36  years,  with  a  modal  age  of  22  years,  mean 
age  of  24  years,  and  standard  deviation  of  2.6  years.  Most  of  the  participants  (93%)  were  men. 
Racial  and  ethnic  distributions  indicated  that  91%  were  White,  2%  were  African  American,  3% 
were  Hispanic,  and  4%  were  other.  All  were  tested  at  either  the  School  of  Aerospace  Medicine  at 
Brooks  City-Base,  TX,  or  at  the  USAF  Academy  in  Colorado  Springs,  CO. 

Measures 

Multidimensional  Aptitude  Battery.  The  MAB  (Jackson,  2003)  is  a  broad-based  test  of 
intellectual  ability  patterned  after  the  Wechsler  Adult  Intelligence  Scale-Revised  (WAIS-R; 
Wechsler,  1981).  The  MAB  has  10  subtests  that  are  combined  to  produce  three  summary  scores: 
verbal  IQ  (VIQ),  performance  IQ  (PIQ),  and  full-scale  IQ  (FSIQ).  Previous  research  has  demon¬ 
strated  that  the  FSIQ  scores  for  the  MAB  and  WAIS-R  are  strongly  correlated  (r  =  .91;  Conoley 
&  Kramer,  1989)  and  that  the  MAB  measures  general  mental  ability  in  several  age  groups 
(Wallbrown,  Carmin,  &  Barnett,  1988).  The  MAB  requires  less  than  1.5  hr  to  administer  and 
can  be  individually  or  group  administered.  The  subtests  each  have  a  normative  mean  of  50  (SD  = 
10).  FSIQ,  VIQ,  and  PIQ  scores  have  a  mean  of  100  (SD  =  15)  in  the  general  population.  MAB 
norms  are  based  on  a  sampling  of  nine  age  groups  that  were  diverse  in  terms  of  gender,  ethnicity, 
and  race  and  North  American  (Canada  and  United  States)  geographic  location.  Test-retest  relia¬ 
bility  for  the  IQ  scores  ranges  from  .94  to  .98  (Jackson,  2003)  for  an  average  retest  interval  of 
45  days. 

Table  1  provides  brief  descriptions  and  reliability  of  the  subtests  and  indicates  the  summary  IQ 
scores  to  which  they  contribute.  Internal  consistency  reliability  of  the  MAB-II  in  a  sample  of  91 
20-year-olds  was  estimated  using  KR-20  (Jackson,  2003).  This  age  group  was  the  most  similar 
to  our  participants.  Reliabilities  of  the  IQ  scores  ranged  from  .97  to  .98  and  reliabilities  of  the 
subtests  ranged  from  .80  to  .96. 
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TABLE  1 

Multidimensional  Aptitude  Battery  (MAB-II)  Subtest  and  Summary  Score  Descriptions  and  Internal 

Consistency  Reliabilities 


Scale 

Subtest 

Description 

Reliability 

VIQ,  FSIQ 

Information 

Assesses  the  extent  to  which  an  individual  has  acquired 
knowledge  about  diverse  topics 

.87 

VIQ,  FSIQ 

Comprehension 

Measures  the  ability  to  evaluate  social  behavior,  identify 
behavior  that  is  more  socially  acceptable,  and  provide 
reasons  why  certain  social  customs  and  laws  are  practiced 

.88 

VIQ,  FSIQ 

Arithmetic 

Assesses  reasoning  and  problem-solving  ability  through  the 
solution  of  numerical  problems 

.80 

VIQ,  FSIQ 

Similarities 

Assesses  the  ability  to  conceptualize  properties  of  an  object 
and  to  compare  them  to  those  of  another  object,  identifying 
the  most  similar  characteristic 

.90 

VIQ,  FSIQ 

Vocabulary 

Measures  the  ability  to  identify  word  meaning 

.88 

PIQ,  FSIQ 

Digit  Symbol 

Assesses  visual-motor  activity  in  substituting  symbols  for 
digits 

.95 

PIQ,  FSIQ 

Picture  Completion 

Measures  the  ability  to  identify  missing  elements  in  a  picture 

.88 

PIQ,  FSIQ 

Spatial 

Assesses  the  ability  to  visualize  abstract  objects  in  different 
positions  in  two-dimensional  space 

.96 

PIQ,  FSIQ 

Picture  Arrangement 

Assesses  the  ability  to  arrange  a  set  of  randomly  ordered 
pictures  into  a  meaningful  sequence 

.85 

PIQ,  FSIQ 

Object  Assembly 

Measures  the  ability  to  identify  a  complete  object  from 
disassembled 

.89 

Note.  Reliability  was  estimated  through  internal  consistency  using  KR-20  (Jackson,  2003).  VIQ  =  verbal  IQ; 
ESIQ  =  full-scale  IQ;  PIQ  =  performance  IQ. 


NEO  Pl—R.  The  NEO  PI-R  (Costa  &  McCrae,  1985)  was  designed  to  measure  the  Big 
Five  personality  domains  and  the  facets  or  traits  that  underlie  each  domain.  The  five  domains 
are  Neuroticism,  Extraversion,  Openness  to  Experience,  Agreeableness,  and  Conscientiousness. 
Each  domain  consists  of  six  subscales  called  facet  scores.  These  domains  and  facets  provide  a 
comprehensive  measurement  of  adult  personality. 

The  NEO  PI-R  was  developed  with  the  goal  of  being  a  multipurpose  personality  inventory 
useful  for  predicting  many  criteria,  such  as  behaviors  related  to  illness,  career  interests,  psycho¬ 
logical  health,  and  styles  of  coping  (Costa  &  McCrae,  1985).  It  contains  240  statements  that 
require  examinees  to  respond  on  a  Likert-type  scale,  ranging  from  1  (strongly  disagree)  to  5 
(strongly  agree).  Table  2  provides  a  description  of  the  five  domain  scales  as  well  as  their  internal 
consistency  reliabilities  (coefficient  alpha)  in  a  sample  of  1,539  men  and  women  in  a  large  orga¬ 
nization.  Reliability  coefficients  for  the  30  facets  are  reported  in  the  test  manual  and  range  from 
.56  to  .81  (Costa  &  McCrae,  1985).  For  this  study,  the  normative  sample  for  adults  served  as  the 
normative  reference  and  the  test  was  administered  and  scored  via  computer  (Costa  &  McCrae, 
1985). 

Training  performance  criterion.  SUPT  consists  of  a  primary  aircraft  training  phase  and  an 
advanced  aircraft  training  phase.  Primary  aircraft  training  (T-6)  consists  of  about  90  hr  of  flight 
training  instruction  over  22  weeks.  The  purpose  is  to  teach  basic  flying  skills  including  contact. 
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TABLE  2 

NEO  Pl-R  Domain  Definitions  and  internai  Consistency  Reiiabiiities 


Test  Definition  Reliability 


Neuroticism  (N)  The  tendency  to  experience  negative  emotions  (anger,  sadness,  fear)  .92 

and  be  emotionally  unstable 

Extraversion  (E)  The  enjoyment  of  social  situations,  excitement,  and  stimulation  .89 

Openness  to  Experience  (O)  A  willingness  to  explore  new  ideas  and  values;  desire  for  aesthetics  .87 

Agreeableness  (A)  The  desire  to  sympathize  with  and  help  others  .86 

Conscientiousness  (C)  Seeking  a  high  level  of  organization  and  planning;  the  tendency  to  plan  .90 

carefully  and  exercise  self-discipline 


Note.  Reliability  was  estimated  through  internal  consistency  using  coefficient  alpha  for  a  developmental  sample  of 
1,539  respondents  (Costa  &  McCrae,  1985). 


instruments,  formation  (2-ship),  and  navigation.  At  the  end  of  this  phase,  students  are  assigned 
to  advanced  training  in  either  the  hghter/bomber  or  the  airlift/tanker  track.  Advanced  training 
track  assignments  are  a  function  of  student  preferences,  training  performance,  instructor  ratings, 
and  aircraft  availability.  The  hghter/bomber  advanced  training  track  (T-38)  includes  about  120  hr 
of  flight  instruction  over  24  weeks  designed  to  prepare  students  for  follow-on  hghter/bomber 
training  assignments.  The  initial  training  focus  is  on  contact,  instruments,  formation  (2/4  ship), 
navigation,  and  low-level  flight.  The  airlift /tanker  advanced  training  track  (T-1)  has  about  1 15  hr 
of  flight  instruction  over  26  weeks.  The  purpose  is  to  prepare  students  for  assignments  to  mul¬ 
tiengine  jet  and  turboprop  aircraft.  The  training  focuses  on  transition,  instruments,  navigation, 
low-level  flight,  and  formation.  It  should  be  noted  that  training  at  Sheppard  AFB  differs  from 
that  at  the  other  three  bases.  Sheppard  AFB  hosts  the  ENJJPT  program,  which  is  focused  on 
training  of  combat  pilots.  It  has  no  airlift/tanker  advanced  training  track.  Also,  ENJJPT  students 
receive  more  flying  hours  in  both  primary  (125  hr  over  26  weeks)  and  Advanced  T-38  (135  hr 
over  26  weeks)  training  than  those  attending  SUPT. 

The  C-Score  is  a  standardized  flying  training  performance  criterion  measure  developed  by 
AETC  to  provide  compatibility  and  comparability  of  performance  at  all  US  Air  Force  pilot  train¬ 
ing  bases.  The  C-Score  was  developed  after  it  was  determined  that  there  were  mean  differences 
in  the  ratings  and  other  measures  of  pilot  training  performance  across  bases.  For  example,  a  very 
high-scoring  pilot  at  Base  A  might  be  scored  lower  than  a  high-scoring  pilot  at  Base  B,  due  to 
idiosyncratic  rating  behavior  by  an  instructor,  check  ride  raters,  or  both.  As  a  result,  comparisons 
across  bases  from  one  pilot  training  class  to  another  were  uncertain. 

To  enable  meaningful  comparisons  (base-to-base,  class-to-class,  year-to-year,  and  pilot-to- 
pilot),  the  C-Score  is  a  percentile  rank  based  on  a  2-year  moving  average.  This  allows  the  C-Score 
to  reflect  the  training  performance  of  each  pilot,  relative  to  the  previous  2  years  of  training 
performance  for  all  pilots.  Using  past  pilot  performance  as  a  moving  baseline  average  pro¬ 
duces  more  reliable,  stable,  and  interpretable  scores,  permitting  distinctions  between  individual 
performances. 

The  C-Score  uses  daily  flying  grades  and  check  flight  grades  weighted  approximately  1  to 
2  in  favor  of  the  check  flights.  Daily  flying  grades  include  instructor  pilots’  evaluations  of  a  pilot 
trainee’s  performance  on  all  flights  other  than  check  flights.  Daily  flying  grades  are  a  weighted 
average  of  all  flying  training  procedures  and  maneuvers  performed  during  a  flight  and  are  rated 
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unsatisfactory,  fair,  good,  and  excellent.  In  addition  to  daily  flights,  during  training,  pilot  trainees 
must  pass  a  check  flight  for  each  course  of  instruction.  As  with  daily  flying  grades,  check  flight 
grades  are  a  weighted  average  of  ratings  of  flying  procedures  and  maneuvers,  which  can  have 
values  of  unsatisfactory,  fair,  good,  and  excellent.  Maneuver  grade-point  values  are  weighted 
based  on  the  importance  of  the  maneuver. 

The  C-Score  calculation  is  standardized  against  approximately  200  previous  students  at  that 
particular  base  or  2  years  of  students,  whichever  is  greater.  The  calculations  for  each  class  are 
based  on  a  moving  average,  as  one  class  is  added  to  the  population  when  the  oldest  class  is 
eliminated  from  the  population.  The  C-Score  is  calculated  for  each  class.  Students  are  ranked  on 
their  C-Score  value  and  each  student  is  given  a  C-Score  percentile  rank,  a  number  between  0% 
and  100%.  The  C-Score  and  percentile  rank  for  a  student  are  only  recorded  when  the  student  is 
part  of  the  graduating  class. 


Analyses 


Analyses  were  conducted  by  training  track,  base,  and  year.  Three  analyses  were  conducted  for 
each  of  these  sets.  First,  descriptive  statistics  were  calculated  for  the  MAB  IQ  Scores,  NEO  PI-R 
domain  scores,  and  C-Score  percentile  rank.  Second,  analyses  (t  tests  or  one-way  analyses  of 
variance  [ANOVAs])  were  conducted  to  determine  statistical  differences  in  mean  scores  for  each 
variable  in  each  category.  Third,  correlational  analyses  were  conducted  to  determine  how  well 
the  MAB  and  NEO  PI-R  scores  predicted  C-Score  percentile  rank. 

Three  sets  of  correlations  were  examined:  observed  (uncorrected)  correlations,  correlations 
corrected  for  range  restriction,  and  correlations  corrected  for  both  range  restriction  and  reliability 
of  the  scores.  The  assumptions  underlying  range  restriction  correction  are  the  same  as  two  of 
the  three  assumptions  underlying  the  computation  of  a  Pearson  product-moment  correlation — 
linearity  of  form  and  homoscedasticity.  If  the  assumptions  are  met  to  estimate  the  correlation 
coefficient,  they  also  are  met  to  compute  the  correction.  Restriction  of  range  generally  causes  sta¬ 
tistical  indexes  to  underestimate  true  values.  The  multivariate  correction  method  (Lawley,  1943) 
was  used  for  the  MAB-II  scores.  The  univariate  Case  II  correction  (Thorndike,  1949)  was  used 
for  the  NEO  PI-R  scores  due  to  a  lack  of  sufficient  data  to  apply  the  multivariate  method.  The 
normative  sample  of  the  MAB-II  and  NEO  PI-R  provided  the  means,  standard  deviations,  and 
correlations  used  for  the  correction.  The  corrected  means,  standard  deviations,  and  correlations 
are  superior  estimates  of  the  population  values  compared  to  the  uncorrected  values.  This  method 
removes  the  bias  from  the  uncorrected  sample  estimates. 

The  range-restriction  corrected  correlations  were  then  corrected  for  reliability  (Hunter  & 


Schmidt,  2004)  of  the  test  scores  and  training  criterion  I  = 


were  corrected  for  the  reliability  of  both  the  test  score  and  criterion  because  we  were  inter¬ 
ested  in  the  theoretical  constructs  underlying  the  measures,  not  the  specific  measures  themselves. 
This  third  set  of  correlations  provides  a  theoretical  estimate  of  the  validities  of  the  underlying 
constructs  when  perfectly  reliable  measures  are  available. 

Sample  sizes  differ  for  each  analysis  and  are  noted  below  each  table.  All  analyses  used  a  one- 
tailed  test.  The  analyses  that  involved  year-to-year  comparisons  used  a  .01  Type  I  error  rate  due 
to  the  large  number  of  comparisons.  All  other  analyses  used  a  .05  Type  I  error  rate.  It  should  be 


CONSISTENCY  OF  PILOT  ATTRIBUTES  255 


noted  that  although  the  very  large  samples  used  in  this  study  ensure  sufficient  statistical  power, 
very  small  differences  will  be  statistically  significant  yet  might  offer  little  practical  predictive 
power.  Although  we  report  statistical  significance,  because  of  the  large  samples  involved,  we 
focus  on  effect  size  {d,  r).  Importantly,  fewer  statistical  differences  (small  effect  sizes)  across 
training  tracks,  bases,  and  years  are  desirable,  as  this  indicates  greater  stability  and  consistency 
in  the  measures. 


RESULTS  AND  DISCUSSION 

The  predictive  validity  of  cognitive  ability  and  personality  was  examined  in  large  samples  of 
USAF  pilot  trainees  by  training  track  and  training  location  for  a  14-year  period.  Consistency 
in  pilot  aptitude  and  training  outcomes  was  also  examined.  Validity  results  were  consistent  with 
previous  hndings  that  cognitive  ability  is  the  best  predictor  of  pilot  training  performance  (Carretta 
&  Ree,  2003;  Ree  &  Carretta,  1996). 

Analyses  by  Training  Track 

The  first  set  of  analyses  was  conducted  by  training  track:  primary,  advanced  T-38,  and  advanced 
T-1.  Data  were  collapsed  across  training  bases  and  years  for  these  analyses. 

Means.  Descriptive  statistics  are  shown  in  Table  3.  The  MAB  IQ  scores  for  the  student  pilots 
were  severely  range  restricted  compared  to  the  normative  values  (M  =  100,  SD  =  15).  The  IQs 
for  each  of  the  training  groups  were  high  at  about  120  (about  1.33  SD  above  the  normative  mean) 
and  the  variances  of  the  scores  were  much  less  than  the  normative  values.  For  the  FSIQ  score,  the 
variance  for  the  trainees  was  about  18%  of  the  normative  value. 


TABLE  3 

Descriptive  Statistics  for  Primary  and  Advanced  Training  Tracks 


Score 

Primary 

Advanced  T-38 

Advanced  T-1 

T-38  Vi.  T-I 

M 

SD 

M 

SD 

M 

SD 

d 

t 

C-Score 

0.52 

0.29 

0.49 

0.29 

0.54 

0.29 

-0.17 

-5.56” 

VIQ 

119.03 

6.57 

120.18 

6.31 

118.19 

6.35 

0.31 

10.15** 

PIQ 

119.41 

8.17 

120.62 

7.90 

120.27 

7.75 

0.04 

1.43 

ESIQ 

120.58 

6.50 

121.83 

6.29 

120.56 

6.17 

0.20 

6.55** 

N 

46.65 

9.37 

46.07 

9.46 

46.29 

9.17 

-0.02 

-0.79 

E 

57.59 

9.56 

58.12 

9.65 

57.65 

9.47 

0.05 

1.58 

O 

50.67 

10.18 

50.49 

10.39 

50.05 

9.66 

0.04 

1.39 

A 

43.81 

10.56 

42.73 

10.66 

44.38 

10.28 

-0.15 

-5.12** 

C 

54.73 

10.17 

55.49 

10.03 

55.60 

9.86 

-0.01 

-0.32 

Note.  Primary  N  =  9,396;  advanced  T-38  N  =  3,295;  advanced  T-I  A  =  1,524.  VIQ  =  verbal  IQ;  PIQ  =  performance 
IQ;  FSIQ  =  full-scale  IQ;  N  =  Neuroticism;  E  =  Extraversion;  O  =  Openness  to  Experience;  A  =  Agreeableness; 
C  =  Conscientiousness. 

*p  <  .05.**p  <  .001. 
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The  mean  score  differences  between  those  assigned  to  the  fighter/bomber  and  airlift/tanker 
tracks  were  small  (1.27  points  for  the  FSIQ  or  .20  d).  The  finding  of  slightly  higher  cognitive 
ability  scores  for  fighter/bomber  trainees  is  consistent  with  the  selection  and  assignment  of  pilots 
for  advanced  training  and  with  prior  studies  (Boyd  et  al.,  2005).  Because  the  T-38  track  leads  to 
more  preferred  assignments  in  fighter/bomber  aircraft,  students  with  higher  cognitive  ability  tend 
to  be  assigned  to  this  track. 

The  results  were  mixed  for  the  NEO  Pl-R,  where  trainees  were  above  the  normative 
mean  score  of  50  for  Extraversion  and  Conscientiousness  and  below  the  normative  mean  for 
Neuroticism  and  Agreeableness.  For  example,  pilots  score  lower  on  Agreeableness  than  the  gen¬ 
eral  population  (King,  Barto,  Ree,  &  Teachout,  2011).  The  lower  mean  for  Agreeableness  for 
trainees  assigned  to  the  fighter/bomber  (T-38)  advanced  training  track  (T-38  =  42.70,  T-1  = 
44.38,  d  —  -.15)  was  consistent  with  previous  results  on  personality  for  the  highly  selected  pilot 
population. 

Independent  groups  t  tests  were  conducted  on  each  of  the  nine  variables  to  identify  significant 
differences  between  the  two  advanced  training  tracks.  Because  the  advanced  tracks  include  the 
students  from  the  primary  track,  no  comparisons  were  made  with  the  primary  track.  Results 
indicated  that  there  were  small  but  statistically  significant  mean  differences  between  the  T-38  and 
T-1  advanced  tracks  for  four  of  the  nine  scores.  Cohen  (1988)  characterized  standardized  mean 
differences  {d)  of  .2  as  small,  .5  as  medium,  and  .8  or  greater  as  large.  All  mean  score  differences 
between  trainees  in  the  T-38  and  T-1  tracks  were  small.  T-38  trainees  scored  higher  on  the  MAB 
VIQ  {d  —  .31)  and  FSIQ  {d  —  .20)  scores  than  did  T-1  trainees.  However,  T-38  trainees  scored 
lower  on  the  NEO  Pl-R  Agreeableness  score  {d  —  -.16)  and  the  C-Score  {d  —  -.17)  than  those 
in  the  T-1  track. 

Correlations.  Table  4  summarizes  the  correlational  analyses  by  training  track.  All  of  the 
MAB  IQ  correlations  with  the  C-Score  were  statistically  significant  for  each  training  phase.  Eight 


TABLE  4 

Observed  and  Corrected  Correlations  of  Multidimensional  Aptitude  Battery  (MAB-II)  IQ  Scores  and  NEO 
Pl-R  Domain  Scores  With  C-Score  Percentile  Rank  by  Training  Track 


Score 

Primary 

Advanced  T-38 

Advanced  T-1 

r 

rc 

ffc 

r 

rc 

r 

rc 

VIQ 

.092** 

.245 

.321 

.095** 

.247 

.324 

.102** 

.198 

.260 

PIQ 

.117** 

.266 

.348 

.115** 

.275 

.361 

.056* 

.150 

.196 

ESIQ 

.126** 

.288 

.377 

.126** 

.295 

.386 

.098*8 

.197 

.258 

N 

-.023* 

-.040 

-.054 

.014 

-.020 

-.027 

.020 

-.140 

-.188 

E 

.008 

-.060 

-.082 

.038* 

-.050 

-.068 

-.002 

-.090 

-.123 

0 

-.064** 

.050 

.069 

-.067** 

.070 

.097 

-.042* 

.060 

.083 

A 

-.019* 

-.030 

-.042 

-.059** 

-.060 

-.083 

.029 

.030 

.041 

C 

.031** 

.000 

.000 

.043* 

.020 

.027 

.107** 

.070 

.095 

Note.  Sample  sizes  were  primary  N  =  9,396;  advanced  T-38  N  =  3,295;  advanced  T-1  N  =  1,524.  Correlations  in 
the  column  labeled  r  were  observed  (uncorrected).  Those  in  the  column  labeled  were  corrected  for  range  restriction 
and  those  in  the  column  labeled  Vfc  were  corrected  for  range  restriction  and  reliability  of  the  scores.  The  MAB  IQ  scores 
were  corrected  using  the  multivariate  method  (Lawley,  1943),  whereas  the  NEO  domain  scores  were  corrected  using 
the  univariate  Case  2  (Thorndike,  1949)  method.  VIQ  =  verbal  IQ;  PIQ  =  performance  IQ;  ESIQ  =  full-scale  IQ;  N  = 
Neuroticism;  E  =  Extraversion;  O  =  Openness  to  Experience;  A  =  Agreeableness;  C  =  Conscientiousness. 

*p  <  .05.  **p  <  .001. 
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of  the  15  correlations  between  the  NEO  PI-R  scores  and  the  C-Score  were  statistically  signifi¬ 
cant.  Cohen  (1988)  characterized  correlations  of  .10  as  small,  .30  as  medium,  and  .50  or  greater 
as  large.  All  of  the  observed  correlations  between  the  MAB-II  and  NEO  Pl-R  with  the  C-Score 
criterion  were  small.  Even  after  correction  for  range  restriction  and  reliability,  only  6  of  the 
24  correlations  with  the  C-Score  exceeded  .30.  These  were  for  the  MAB-II  scores  and  C-Scores 
for  the  T-6  and  T-38  tracks.  Overall,  the  magnitudes  of  the  correlations  were  higher  for  cognitive 
ability  (MAB)  than  for  personality  traits  (NEO  PI-R). 

The  overall  correlational  results  for  training  tracks  indicated  that  cognitive  ability  was  related 
to  pilot  training  success  for  all  three  tracks,  and  these  correlations  were  higher  than  those  for 
the  personality  trait  measures.  Small  differences  in  the  magnitude  of  validities  of  the  cognitive 
test  scores  by  training  track  were  observed  with  lower  values  for  T-1  training.  For  example,  after 
correction  for  range  restriction  and  reliability  of  the  measures,  the  MAB  FSIQ  score  validities 
were  .377  for  primary  (T-6),  .386  for  advanced  hghter/bomber  (T-38),  and  .258  for  advanced 
airlift /tanker  (T-1)  training.  The  reason  for  these  differences  is  unknown;  however,  they  might  be 
due  to  differing  rater  accuracy  among  other  factors. 

Analyses  by  Base 

The  second  set  of  analyses  was  conducted  by  base  (Columbus,  Laughlin,  Sheppard,^  and  Vance), 
for  primary,  advanced  T-38,  and  advanced  T-1  training.  Due  to  space  limitations,  the  tables  sum¬ 
marizing  these  analyses  cannot  be  presented  here.  Interested  readers  should  consult  Teachout 
et  al.  (2013). 

Means 

Descriptive  statistics  were  calculated  for  each  of  the  nine  variables.  One-way  ANOVAs  were 
conducted  to  identify  any  statistically  signihcant  differences  among  the  bases  for  primary,  T-38, 
and  T-1  training. 

Primary  training.  The  sample  sizes  by  base  for  primary  training  ranged  from  1 ,023  to  2,78 1 . 
Results  indicated  that  there  were  small  (Cohen,  1988)  but  statistically  signihcant  mean  score 
differences  between  bases  for  six  variables.  Sheppard  AFB  differed  from  the  other  bases  with 
primary  trainees  about  2  points  higher  on  all  three  MAB  IQ  scores.  The  standardized  mean  dif¬ 
ference  (d)  on  the  FSIQ  score  between  Sheppard  and  the  other  bases  ranged  from  .33  to  .39. 
Further,  trainees  at  Sheppard  were  signihcantly  lower  on  Agreeableness  (about  1  point  or  .10  d) 
and  higher  on  Conscientiousness  (about  3  points  or  .31  d)  than  trainees  at  the  other  bases.  These 
results  could  be  due  to  the  selectiveness  of  the  ENJJPT  program. 

Advanced  T-38  training.  The  sample  sizes  by  base  for  Advanced  T-38  training  ranged  from 
650  to  1,006.  There  were  small  but  statistically  signihcant  mean  differences  among  the  bases.  The 
differences  were  between  Sheppard  and  one  or  more  of  the  other  bases,  paralleling  the  results  for 
primary  training.  The  MAB  scores  at  Sheppard  were  higher  than  for  the  other  bases. 


^Sheppard  AFB,  which  hosts  the  combat-oriented  ENJJPT  program,  does  not  have  an  advanced  T-I  training  track. 
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Advanced  T-1  training.  The  sample  sizes  by  base  for  Advanced  T-1  training  ranged  from 
351  to  589.  There  is  no  T-1  track  at  Sheppard  AFB.  Results  indicated  that  there  were  small  but 
statistically  significant  mean  differences  among  bases  for  only  the  C-Score.  The  C-Score  for 
Columbus  was  significantly  lower  than  Laughlin  (d  =  -.28)  and  Vance  (d  =  -.30). 


Correiations 

The  pattern  of  correlations  between  the  MAB  and  NEO  PI-R  scores  and  C-Score  by  base  was 
similar  to  those  observed  when  the  data  were  collapsed  across  bases  (see  Table  4). 


Primary  training.  For  each  base,  all  three  MAB  IQ  scores  demonstrated  small  but  statisti¬ 
cally  significant  relations  to  the  C-Score.  For  example,  the  correlations  between  the  MAB  FSIQ 
and  C-Score  ranged  from.380  to  .428  after  correction  for  range  restriction  and  reliability.  The 
relations  between  the  NEO  PI-R  scores  and  the  C-Score  were  weaker  than  those  for  the  MAB. 
Only  7  of  20  correlations  were  statistically  significant. 


Advanced  T-38  training.  Validities  of  the  test  scores  for  predicting  T-38  training  perfor¬ 
mance  were  generally  lower  and  less  consistent  than  those  for  primary  training.  The  correlation 
between  the  MAB  FSIQ  and  C-Scores  ranged  from  .170  to  .458  after  correction  for  range  restric¬ 
tion  and  reliability.  As  with  primary  training,  the  correlations  between  the  NEO  Pl-R  scores  and 
C-Score  were  weaker  than  those  for  the  MAB  with  only  7  of  20  NEO  Pl-R/C-Score  correla¬ 
tions  being  statistically  significant.  Three  of  the  seven  statistically  significant  correlations  were 
for  Openness. 


Advanced  T-1  training.  As  with  T-38  training,  results  for  T-1  training  were  less  consistent 
than  those  for  primary  training.  The  correlations  between  the  MAB  ESIQ  and  C-Scores  ranged 
from  .175  to  .406  after  correction  for  range  restriction  and  reliability.  The  MAB  PIQ  score  was 
not  related  to  training  performance  for  T-1  training. 

Overall,  the  magnitude  of  the  correlations  was  higher  for  cognitive  ability  (MAB)  compared 
to  personality  traits  (NEO  PI-R).  Only  2  of  the  15  correlations  between  the  NEO  PI-R  scores  and 
the  C-Score  were  statistically  significant.  Both  were  for  Conscientiousness  at  Laughlin  (.057)  and 
Vance  (.317)  after  correction  for  range  restriction  and  reliability. 

The  most  consistent  result  for  comparisons  of  trainee  quality  across  training  bases  was  that 
Sheppard  AFB  had  higher  quality  pilot  trainees  based  on  higher  cognitive  ability  scores  and 
higher  scores  on  Conscientiousness,  a  key  personality  trait  predictive  of  success  in  all  jobs 
(Barrick  &  Mount,  1991).  These  pilots  also  were  lower  on  Agreeableness.  Further  examination 
of  student  assignment  to  different  bases  is  warranted  to  understand  these  differences.  We  can 
only  speculate  as  to  the  underlying  cause  of  these  relations.  Sheppard  AFB  is  where  the  combat- 
oriented  ENJJPT  program  is  located.  There  is  no  separate  advanced  training  track  for  nonfighter 
pilots.  As  a  result,  it  is  likely  that  pilot  candidates  who  are  considered  to  have  a  high  probability 
of  becoming  fighter-qualified  are  assigned  to  ENJJPT. 
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Analyses  by  Year 

The  third  set  of  analyses  was  conducted  by  year  (1995-2008)  for  each  training  phase.  Due 
to  space  limitations,  the  tables  summarizing  these  analyses  cannot  be  presented  here  but  are 
available  elsewhere  (Teachout  et  al.,  2013). 

Means 

A  one-way  ANOVA  was  conducted  on  each  of  the  nine  scores  to  determine  statistically  signif¬ 
icant  differences  among  the  14  years  for  each  training  phase.  The  numerous  comparisons  for  these 
analyses  (91  comparisons  for  each  of  9  scores  for  each  phase  =819  comparisons/phase)  should 
be  viewed  with  caution  due  to  the  increased  likelihood  of  Type  1  error;  that  is,  finding  significant 
differences  by  chance  as  the  number  of  comparisons  increases.  For  this  reason,  a  p  <  .01  level 
of  significance  was  used  for  comparing  these  mean  differences.  Further,  rather  than  reporting 
and  interpreting  all  of  the  significant  differences,  we  focused  on  data  trends.  As  described  in 
what  follows,  most  of  the  statistically  significant  mean  score  differences  occurred  for  primary 
training.  It  is  likely  that  primary  training  attrition  and  the  advanced  training  assignment  process 
contributed  to  making  the  advanced  training  groups  less  variable. 

Primary  training.  Results  indicated  there  were  statistically  significant  differences  for  eight 
of  the  nine  scores  for  primary  training.  Overall,  although  there  were  some  statistically  significant 
differences  (75/819  =  9.1%),  the  scores  were  very  stable,  indicating  that  the  characteristics  and 
quality  of  pilot  trainees  were  consistent  over  time.  Further,  all  of  the  effect  sizes  were  small.  The 
number  of  significant  differences  was  largest  for  the  MAB  PIQ  score  (22/91  =  24.1%)  and  C- 
Score  (13/91=  14.3%;  see  Table  5).  Sixteen  of  the  22  significant  differences  for  PIQ  were  for 
years  2001  to  2003,  where  the  PIQ  scores  were  lower  than  for  other  years.  For  the  C-Score,  the 


TABLE  5 

Number  of  Statistically  Significant  Mean  Score  Differences  Across  Years 


Score 

Training  Phase 

Primary 

Advanced  T-38 

Advanced  T-1 

C-Score 

13 

11 

3 

VIQ 

3 

0 

0 

PIQ 

22 

7 

0 

FSIQ 

5 

0 

0 

Neuroticism 

9 

1 

0 

Extraversion 

0 

2 

0 

Openness 

2 

0 

0 

Agreeableness 

11 

3 

1 

Conscientiousness 

10 

3 

0 

Total 

75 

27 

4 

Note.  The  numbers  indicate  the  number  of  statistically  .significant  mean  score  differences  at  the 
p  <  .01  level.  VIQ  =  verbal  IQ;  PIQ  =  performance  IQ;  FSIQ  =  full-scale  IQ. 
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mean  for  1997  was  higher  than  that  for  1999  and  2003  to  2006  and  the  mean  for  2002  was  higher 
than  those  for  2003  to  2006. 

Advanced  T-38  training.  The  degree  of  consistency  in  mean  scores  was  greater  for 
advanced  training  than  for  primary  training.  Six  of  the  nine  scores  exhibited  signihcant  differ¬ 
ences  for  T-38  training.  Only  3%  (25/891)  of  the  comparisons  reached  statistical  significance. 
As  with  primary  training,  all  of  the  effect  sizes  were  small  and  most  of  the  signihcant  differ¬ 
ences  occurred  for  the  C-Score  (11)  and  MAB  PIQ  (7).  For  the  C-Score,  10  of  the  1 1  differences 
occurred  for  2000  and  2001,  which  were  lower  than  other  years.  The  MAB  PIQ  scores  for 
2005  and  2006  were  higher  than  those  for  2000  to  2003. 

Advanced  T-1  training.'^  Only  two  of  nine  scores  showed  statistically  signihcant  mean 
score  differences  across  year  of  training.  Only  7.4%  of  the  comparisons  (4/54)  were  statistically 
signihcant,  indicating  a  remarkable  degree  of  consistency  in  scores  for  the  T-1  trainees. 

Correlations 

The  correlational  results  broken  out  by  year  of  training  were  consistent  with  those  reported 
earlier  where  the  data  were  collapsed  across  years.  Overall,  the  magnitude  of  the  correlations 
with  the  C-Score  were  higher  for  cognitive  ability  (MAB)  than  for  personality  traits  (NEO  PI-R). 

Primary  training.  Although  there  was  some  variability,  the  magnitude  of  the  correlations 
between  the  MAB  and  NEO  PI-R  scores  with  the  C-Score  by  years  was  consistent  and  mir¬ 
rored  the  results  summed  across  years.  Overall,  the  magnitude  of  the  correlations  was  higher  for 
cognitive  ability  compared  to  personality  traits. 

Advanced  T-38  training.  Again,  the  results  broken  out  by  year  were  consistent  with  those 
accumulated  across  years  of  training.  The  magnitude  of  the  correlations  with  the  C-Score  was 
higher  for  cognitive  ability  than  personality  traits. 

Advanced  T-1  training.  Consistent  with  previous  analyses,  overall,  the  magnitude  of  the 
correlations  with  the  C-Score  was  higher  for  cognitive  ability  than  for  personality  traits.  Further, 
there  was  little  variability  by  year. 

Given  the  large  number  of  year-to-year  comparisons  made,  the  number  of  statistically  sig¬ 
nificant  differences  was  extremely  small  (5.6%  across  training  tracks).  This  result  illustrates  the 
consistency  of  pilot  selection  methods  and  standards  and  their  effect  on  trainee  quality  (cognitive 
ability  and  personality  traits)  over  time.  With  pilot  trainee  characteristics  this  stable,  fewer  disrup¬ 
tions  and  adjustments  are  needed,  the  training  system  is  more  stable,  enabling  greater  efficiency 
and  effectiveness. 

There  were  more  year-to-year  differences  noted  in  the  C-Score.  One  possible  explanation  is 
fluctuation  in  managed  attrition  rates  as  projected  manpower  needs  are  adjusted  by  pilot  training 
managers.  Another  possible  source  of  score  fluctuation  is  variation  in  the  application  of  scoring 
criteria  due  to  turnover  in  instructor  pilots.  More  research  is  needed  to  investigate  variability  in 
C-Scores  over  time. 


^T-l  training  began  in  2005.  Prior  to  2005  a  different  aircraft  was  used  in  airlift/tanker  training. 
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Results  for  personality  trait  measures  were  consistent  with  meta-analytic  studies  regarding 
the  predictiveness  of  commonly  used  selection  methods  for  both  pilot  training  (Hunter  &  Burke, 
1994;  Martinussen,  1996)  and  in  the  broader  context  of  personnel  selection  (Schmidt  &  Hunter, 
1998). 


CONCLUSIONS  AND  RECOMMENDATIONS 

Although  the  observed  validities  for  cognitive  ability  were  small  (Cohen,  1988),  six  of  nine  corre¬ 
lations  between  the  cognitive  test  scores  and  training  criterion  (see  Table  4)  were  in  the  moderate 
range  (.3  <  r  <  .5)  after  correction  for  range  restriction  and  reliability.  The  observed  and  corrected 
validities  for  personality  traits  were  small  and  were  consistent  with  previous  studies  (Anesgart  & 
Callister,  2001;  Campbell  et  al.,  2010;  D.  R.  Hunter  &  Burke,  1994;  Martinussen,  1996;  Siem, 
1992).  There  were  few  differences  across  training  tracks,  bases,  and  years,  and  none  was  large. 
The  relative  strength  of  the  validities  for  the  cognitive  and  personality  trait  measures  was  consis¬ 
tent  with  meta-analytic  studies  regarding  the  predictiveness  of  commonly  used  selection  methods 
for  both  pilot  training  (D.  R.  Hunter  &  Burke,  1994;  Martinussen,  1996)  and  in  the  broader 
context  of  personnel  selection  (Schmidt  &  Hunter,  1998). 

The  role  of  cognitive  ability  in  pilot  training  has  been  to  facilitate  the  acquisition  of  pilot  job 
knowledge  and  flying  skills  (Ree,  Carretta,  &  Teachout,  1995).  The  acquisition  of  knowledge  and 
skill  in  early  pilot  training  has  been  shown  to  facilitate  further  knowledge  and  skills  acquisition  in 
later  training.  Path  and  structural  equation  models  (Ree  et  al.,  1995)  showed  the  direct  and  indirect 
effects  of  cognitive  ability  on  the  acquisition  of  pilot  job  knowledge  and  flying  skills.  These  direct 
and  indirect  effects  probably  account  for  the  smaller  validity  coefficients  for  cognitive  ability  in 
advanced  training  in  this  study.  Additional  studies  are  needed  to  examine  the  role  of  personality 
traits  in  the  acquisition  of  pilot  job  knowledge  and  flying  skills. 

Overall,  these  results  convey  two  notable  messages.  First,  consistent  with  prior  studies,  mea¬ 
sures  of  cognitive  ability  and  personality  traits  are  important  determinants  of  pilot  training 
success.  Second,  the  quality  of  USAF  pilot  trainees  has  been  remarkably  consistent  across  train¬ 
ing  tracks  and  training  locations  over  a  14-year  period.  This  is  likely  a  function  of  the  availability 
of  sufficient  numbers  of  high-quality  applicants  to  All  available  training  positions  and  consistency 
in  selection  and  training  methods.  These  two  messages  are  important  for  improving  pilot  selec¬ 
tion  and  for  practical  application  by  decision  makers  involved  in  setting  selection  and  training 
requirements,  and  evaluating  pilot  training  applicant  suitability. 

Improving  Selection 

The  corrected  validities  were  in  the  moderate  range,  suggesting  that  there  is  a  substantial  propor¬ 
tion  of  criterion  variance  remaining  to  be  predicted.  The  total  amount  of  criterion  validity  that  can 
be  predicted  is  limited  by  external  influences  that  might  not  be  predictable.  Student  performance 
varies  in  pilot  training  for  several  reasons,  not  all  of  which  are  related  to  ability  or  personal¬ 
ity  traits.  Some  students  could  have  personal  problems  that  interfere  with  training  performance. 
Others  might  have  strong  support  from  family  that  fortifies  their  training  performance.  These 
and  other  outside  influences  should  not  be  expected  to  be  predicted  by  either  cognitive  ability  or 
personality  traits  (Ree  &  Carretta,  1999). 
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Despite  these  limitations  that  reduce  the  magnitude  of  predictive  relationships  with  pilot  train¬ 
ing  outcomes,  current  USAF  selection  and  classihcation  methods  do  not  leverage  measures  of 
cognitive  ability  and  personality  traits  in  an  optimal  manner  to  predict  the  remaining  criterion 
variance.  Although  cognitive  ability  is  represented  in  USAF  pilot  trainee  selection  methods  such 
as  the  AFOQT  and  PCSM,  measures  of  personality  traits  are  not.  Also,  neither  measures  of  cog¬ 
nitive  ability  nor  personality  traits  are  considered  when  making  advanced  training  assignments. 
To  this  end,  we  recommend  that  studies  be  conducted  to  examine  the  incremental  validity  of 
personality  measures  for  USAF  pilot  training  qualihcation  when  used  in  combination  with  the 
PCSM  score  and  measures  of  pilot  aptitude.  Further  they  should  be  examined  to  determine  their 
utility  in  improving  advanced  training  assignments  when  used  in  combination  with  preliminary 
training  performance,  instructor  ratings,  and  student  preferences.  Finally,  measures  of  psychomo¬ 
tor  performance  should  be  included,  as  should  measures  of  aviation-job  knowledge  and  flying 
experience  (Carretta  &  Ree,  2003). 

Having  good  predictors  is  necessary  but  not  sufficient  for  an  optimal  selection  system.  The 
criteria  must  be  free  of  contamination  and  dehciency.  As  with  predictors,  criterion  measures 
should  be  evaluated  for  evidence  of  construct  validity.  The  identification  of  good  criteria  is  just 
as  important  as  the  identification  of  good  predictors. 

Practical  Applications 

This  study  demonstrated  that  pilot  trainee  quality  and  training  performance  were  consistent  over 
training  track,  training  location,  and  time.  The  high  quality  of  pilot  trainees  as  assessed  by  cog¬ 
nitive  ability  and  personality  trait  measures  and  the  consistency  of  these  measures  in  predicting 
training  performance  over  time  enables  the  consistent  production  of  high-quality  pilots.  This  sta¬ 
bility  in  the  selection  and  training  system  has  multiple  benefits.  Importantly,  Air  Force  decision 
makers  can  rely  on  this  stability  for  making  policy,  setting  selection  and  training  standards,  and 
longer  term  planning  activities  (e.g.,  pilot  production  requirements).  In  addition,  in  the  military 
aviation  training  system,  consistency  in  trainee  quality  helps  stabilize  training  methods  (e.g., 
course  content,  instructional  approaches,  time  and  resources  required  to  train  students  to  meet 
rigorous  standards).  This  enables  the  organization  to  meet  its  production  goals  (i.e.,  number  of 
graduates)  more  efficiently  and  effectively  over  time. 
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